数据来源

https://www.kaggle.com/datasets/aagambshah/lung-cancer-dataset

数据描述

This dataset contains responses from individuals who participated in a survey to identify behavioral and demographic factors associated with lung cancer. The dataset can be used for exploratory data analysis, statistical modeling, and machine learning classification tasks to predict lung cancer risk.

过程

导入库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.metrics import accuracy_score, roc_auc_score

import torch
import torch.nn as nn
import torch.optim as optim

import data_analysis_tools as dat

查看数据

1 2	df = pd.read_csv('survey lung cancer.csv') df.head()

我们查看数据特征的意思是什么:

GENDER 受访者性别（男/女）
AGE 受访者年龄
SMOKING 吸烟习惯（是/否）
YELLOW_FINGERS 手指是否变黄（是/否）
ANXIETY 存在焦虑（是/否）
PEER_PRESSURE 经历过同伴压力（是/否）
CHRONIC DISEASE 现有慢性疾病（是/否）
FATIGUE 是否疲劳（是/否）
ALLERGY 过敏情况（是/否）
WHEEZING 喘息症状（是/否）
ALCOHOL CONSUMING 饮酒习惯（是/否）
COUGHING 经常咳嗽（是/否）
SHORTNESS OF BREATH 呼吸困难症状（是/否）
SWALLOWING DIFFICULTY 吞咽困难（是/否）
CHEST PAIN 有无胸痛（是/否）
LUNG_CANCER 肺癌诊断（是/否）

1	df.describe()

1	df.isnull().sum()

GENDER 0
AGE 0
SMOKING 0
YELLOW_FINGERS 0
ANXIETY 0
PEER_PRESSURE 0
CHRONIC DISEASE 0
FATIGUE 0
ALLERGY 0
WHEEZING 0
ALCOHOL CONSUMING 0
COUGHING 0
SHORTNESS OF BREATH 0
SWALLOWING DIFFICULTY 0
CHEST PAIN 0
LUNG_CANCER 0
dtype: int64

由以上信息可知该数据集的特征值有15个，target有1个，该数据集是一个二分类问题。
该数据集合没有缺失值数据集用1、2表示是和否

特征编码

我们将数据的二分类使用0，1表示

1	df.columns.to_list()

[‘GENDER’,
‘AGE’,
‘SMOKING’,
‘YELLOW_FINGERS’,
‘ANXIETY’,
‘PEER_PRESSURE’,
‘CHRONIC DISEASE’,
‘FATIGUE ‘,
‘ALLERGY ‘,
‘WHEEZING’,
‘ALCOHOL CONSUMING’,
‘COUGHING’,
‘SHORTNESS OF BREATH’,
‘SWALLOWING DIFFICULTY’,
‘CHEST PAIN’,
‘LUNG_CANCER’]

 df['GENDER'] = df['GENDER'].map({'M': 1, 'F': 0})
 
 columns = [
 'SMOKING',
 'YELLOW_FINGERS',
 'ANXIETY',
 'PEER_PRESSURE',
 'CHRONIC DISEASE',
 'FATIGUE ',
 'ALLERGY ',
 'WHEEZING',
 'ALCOHOL CONSUMING',
 'COUGHING',
 'SHORTNESS OF BREATH',
 'SWALLOWING DIFFICULTY',
 'CHEST PAIN']

for column in columns:
    df[column] = df[column].map({1: 0, 2: 1})
df['LUNG_CANCER'] = df['LUNG_CANCER'].map({'YES': 1, 'NO': 0})

EDA

1	dat.plot_all_barplots(df, hue='LUNG_CANCER')

可以看到患有肺癌的人各项水平均比不患有肺癌的人要高

df.corr()

从上述相关性矩阵中我们发现GENDER和ALCOHOL CONSUMING的相关性系数比较高我们查看他们的关系

1 2	sns.barplot(df, x = 'GENDER',y = 'ALCOHOL CONSUMING', hue = 'LUNG_CANCER') plt.show()

可以看到在男性中不饮酒的男性患肺癌的概率比饮酒的男性要低
在女性中

feature_importance = df.corr()['LUNG_CANCER'].sort_values(ascending=False)
feature_importance = feature_importance[1:]
plt.figure(figsize=(10, 6))
sns.barplot(x=feature_importance.values, y=feature_importance.index)
plt.title('Feature Importance')
plt.xlabel('Correlation with Lung Cancer')
plt.ylabel('Features')
plt.show()

数据划分

X = df.drop(['LUNG_CANCER'], axis=1)
y = df['LUNG_CANCER']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

标准化

1
2
3

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

模型建立

随机森林

rf = RandomForestClassifier(n_estimators=100, random_state=42)

rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

report = classification_report(y_test, y_pred)
print(report)

            precision    recall  f1-score   support

       0       0.50      0.50      0.50         2
       1       0.98      0.98      0.98        60

accuracy                           0.97        62
macro avg       0.74      0.74      0.74        62
weighted avg       0.97      0.97      0.97        62

神经网络

# 转换为 NumPy 数组
X = df.drop(columns=['LUNG_CANCER']).values.astype(np.float32)
y = df['LUNG_CANCER'].values.astype(np.float32)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 特征缩放
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train).astype(np.float32)
X_test = scaler.transform(X_test).astype(np.float32)

# 转换为 PyTorch 张量
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)  # 增加一维

# 修正部分：确保 y_test 是 NumPy 数组
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)  # 增加一维


class BinaryClassifier(nn.Module):
    def __init__(self, input_dim):
        super(BinaryClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, 16)  # 第一个隐藏层
        self.fc2 = nn.Linear(16, 8)         # 第二个隐藏层
        self.fc3 = nn.Linear(8, 1)          # 输出层

    def forward(self, x):
        x = torch.relu(self.fc1(x))  # ReLU 激活函数
        x = torch.relu(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))  # Sigmoid 激活函数
        return x
        
# 初始化模型
input_dim = X_train.shape[1]
model = BinaryClassifier(input_dim)

# 定义损失函数和优化器
criterion = nn.BCELoss()  # 二分类交叉熵损失
optimizer = optim.Adam(model.parameters(), lr=0.001)


# 训练参数
epochs = 50
batch_size = 32

# 训练循环
for epoch in range(epochs):
    model.train()  # 设置为训练模式
    running_loss = 0.0

    # 小批量训练
    for i in range(0, len(X_train_tensor), batch_size):
        inputs = X_train_tensor[i:i+batch_size]
        labels = y_train_tensor[i:i+batch_size]

        # 前向传播
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # 反向传播和优化
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch [{epoch+1}/{epochs}], Loss: {running_loss:.4f}")

Epoch [1/50], Loss: 5.3979
Epoch [2/50], Loss: 5.2241
Epoch [3/50], Loss: 5.0661
Epoch [4/50], Loss: 4.9229
Epoch [5/50], Loss: 4.7911
Epoch [6/50], Loss: 4.6619
Epoch [7/50], Loss: 4.5310
Epoch [8/50], Loss: 4.3962
Epoch [9/50], Loss: 4.2550
Epoch [10/50], Loss: 4.1069
Epoch [11/50], Loss: 3.9532
Epoch [12/50], Loss: 3.7950
Epoch [13/50], Loss: 3.6341
Epoch [14/50], Loss: 3.4730
Epoch [15/50], Loss: 3.3150
Epoch [16/50], Loss: 3.1634
Epoch [17/50], Loss: 3.0199
Epoch [18/50], Loss: 2.8862
Epoch [19/50], Loss: 2.7630
Epoch [20/50], Loss: 2.6512
Epoch [21/50], Loss: 2.5504
Epoch [22/50], Loss: 2.4604
Epoch [23/50], Loss: 2.3804
Epoch [24/50], Loss: 2.3079
Epoch [25/50], Loss: 2.2413
…
Epoch [47/50], Loss: 1.3242
Epoch [48/50], Loss: 1.3006
Epoch [49/50], Loss: 1.2785
Epoch [50/50], Loss: 1.2578

# 在测试集上评估模型
model.eval()  # 设置为评估模式
with torch.no_grad():
    y_pred_prob = model(X_test_tensor).numpy().flatten()  # 预测概率
    y_pred = (y_pred_prob > 0.5).astype(int)              # 转换为类别标签

# 计算准确率和 AUC-ROC
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_prob)

print("分类报告:")
print(classification_report(y_test, y_pred))

print(f"测试集准确率: {accuracy:.4f}")
print(f"测试集 AUC-ROC: {auc:.4f}")

分类报告:

 precision    recall  f1-score   support

     0.0       0.50      0.50      0.50         2
     1.0       0.98      0.98      0.98        60

accuracy                           0.97        62
macro avg       0.74      0.74      0.74        62
weighted avg       0.97      0.97      0.97        62

测试集准确率: 0.9677
测试集 AUC-ROC: 0.9500

智浩的Blog

survey lung cancer